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PARALLEL  PARTITIONING  STRATEGIES  FOR  THE 
ADAPTIVE  SOLUTION  OF  CONSERVATION  LAWS  * 

KAREN  D.  DEVINE*,  JOSEPH  E.  FLAHERTY* ,  RAYMOND  M.  LOY*  AND 

STEPHEN  R.  WHEATS 


Abstract.  We  describe  and  examine  the  performance  of  adaptive  methods  for  solv¬ 
ing  hyperbolic  systems  of  conservation  laws  on  massively  parallel  computers.  The  differ¬ 
ential  system  is  approximated  by  a  discontinuous  Galerkin  finite  element  method  with  a 
hierarchical  Legendre  piecewise  polynomial  basis  for  the  spatial  discretization.  Fluxes  at 
element  boundaries  are  computed  by  solving  an  approximate  Riemann  problem;  a  pro¬ 
jection  limiter  is  applied  to  keep  the  average  solution  monotone;  time  discretization  is 
performed  by  Runge-Kutta  integration;  and  a  p-reftnement-based  error  estimate  is  used 
as  an  enrichment  indicator.  Adaptive  order  (p-)  and  mesh  ( h -)  refinement  algorithms  are 
presented  and  demonstrated.  Using  an  element-based  dynamic  load  balancing  algorithm 
called  tiling  and  adaptive  p-refinement,  parallel  efficiencies  of  over  60%  are  achieved  on 
a  1024-processor  nCUBE/2  hypercube.  We  also  demonstrate  a  fast,  tree-based  parallel 
partitioning  strategy  for  three-dimensional  octree-structured  meshes.  This  method  pro¬ 
duces  partition  quality  comparable  to  recursive  spectral  bisection  at  a  greatly  reduced 
cost. 

Key  words.  Adaptive  methods,  hyperbolic  systems  of  conservation  laws,  massively 
parallel  computation,  Galerkin  finite  element  method,  h-refinement,  p-refinement,  load 
balancing,  tiling,  domain  decomposition,  octree-derived  meshes. 

AMS(MOS)  subject  classifications.  65M20,  65M50,  65M60. 

1.  Introduction.  Adaptive  finite  difference  and  finite  element  meth¬ 
ods,  which  automatically  refine  or  coarsen  meshes  and  vary  the  order 
of  accuracy  of  the  numerical  solution,  offer  greater  robustness  and  com¬ 
putational  efficiency  than  traditional  methods.  High-order  methods  and 
the  combination  of  mesh  refinement  and  order  variation  (/ip-refinement) 
have  been  shown  to  produce  effective  solution  techniques  for  elliptic  [7,28] 
and  parabolic  [2,3,10,26]  problems.  With  few  exceptions  [11,16],  work  on 
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adaptive  methods  for  hyperbolic  systems  has  concentrated  on  /i-refinement 
[5,8,12]. 

Distributed-memory,  massively  parallel  computers  have  enabled  the 
development  of  applications  requiring  computational  resources  previously 
unavailable.  Finite  difference  and  finite  element  methods  for  structural 
mechanics  and  fluid  dynamics  problems,  for  example,  often  require  mil¬ 
lions  of  degrees  of  freedom  to  accurately  simulate  physical  phenomenon. 
When  solving  partial  differential  equations  (PDEs)  on  MIMD  computers, 
spatial  data  must  be  distributed  across  the  processors’  memory  while  min¬ 
imizing  the  amount  of  data  that  must  be  exchanged  between  processors. 
This  problem  is  especially  acute  when  dealing  with  (i)  adaptive  methods, 
where  mesh  structure  and  work  loads  change  during  the  computation,  and 
(«)  three-dimensional  meshes,  whose  data  grow  at  a  faster  rate  than  in  two- 
dimensions  when  performing  /i-refinement.  The  challenge  is  to  combine  the 
computational  efficiency  of  adaptive  methods  with  the  computational  re¬ 
sources  of  massively  parallel  computation. 

We  consider  systems  of  d-dimensional  hyperbolic  conservation  laws  in 
m  variables  having  the  form 

d 

(1.1a)  u«(x,<)  +  ^fi(x,t,u)rj  =  0,  xGD,  t  >  0, 

i=i 

with  the  initial  conditions 

(1.1b)  u(x,  0)  =  u°(x),  xefiu5n, 

and  appropriate  well-posed  boundary  conditions.  The  subscripts  t  and 

i  —  1,2,  denote  partial  differentiation  with  respect  to  time  and 

the  spatial  coordinates,  and  u,  u°,  and  f<,  i  =  1,2,  ...,d,  are  m-vectors 
on  the  problem  domain  Cl  x  (t  >  0).  Finite  difference  schemes  for  (1.1), 
such  as  the  Total  Variation  Diminishing  (TVD)  [33,36]  and  Essentially 
Non-Oscillatory  (ENO)  [31]  methods,  usually  achieve  high-order  accuracy 
by  using  a  computational  stencil  that  enlarges  with  order.  A  wide  sten¬ 
cil  makes  the  methods  difficult  to  implement  on  unstructured  meshes  and 
limits  efficient  implementation  on  massively  parallel  computers.  Finite  ele¬ 
ment  methods,  however,  have  stencils  that  are  invariant  with  method  order, 
allowing  them  to  easily  model  problems  with  complicated  geometries  and 
to  be  efficiently  parallelized. 

We  use  a  discontinuous  Galerkin  finite  element  method  [11,13,14]  where 
the  spatial  approximation  is  continuous  within  an  element,  but  may  be 
discontinuous  at  interelement  boundaries  to  accommodate  solution  discon¬ 
tinuities  more  accurately.  Fluxes  at  element  boundaries  are  computed  by 
solving  an  approximate  Riemann  problem  with  a  projection  limiter  applied 
to  keep  the  average  solution  monotone  near  discontinuities  [11,14,36].  Time 
discretization  is  performed  by  an  explicit  Runge-Kutta  method. 
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The  discontinuous  Galerkin  method  is  well  suited  to  parallelization 
on  massively  parallel  computers.  The  computational  stencil  involves  only 
nearest-neighbor  communication  regardless  of  the  degree  of  the  piecewise 
polynomial  approximation  and  the  spatial  dimension.  Additional  storage  is 
needed  for  only  one  row  of  “ghost”  elements  along  each  edge  of  a  processor’s 
subdomain.  Thus,  the  size  of  the  problem  scales  easily  with  the  number  of 
processors.  Indeed,  for  two-dimensional  problems  on  rectangular  domains 
with  periodic  boundary  conditions,  scaled  parallel  efficiencies  in  excess  of 
97%  are  achieved  [11], 

To  achieve  parallel  efficiency  with  irregular  structures,  parallel  finite 
element  methods  often  use  static  load  balancing  [19,21]  as  a  precursor  to 
obtaining  a  finite  element  solution.  Parallel  efficiency  degrades  substan¬ 
tially  due  to  processor  load  imbalances  with  adaptive  enrichment.  Even 
with  the  lower  parallel  efficiency,  however,  execution  times  for  comparable 
accuracy  are  shorter  with  adaptive  methods  than  for  fixed-order  methods. 

We  have  developed  an  adaptive  p-refinement  method  for  two-dimen¬ 
sional  systems  that  uses  dynamic  load  balancing  to  adjust  the  processor 
decomposition  in  the  presence  of  nonuniform  and  changing  work  loads. 
Tiling  [37]  is  a  modification  of  a  dynamic  load  balancing  technique  devel¬ 
oped  by  Leiss  and  Reddy  [24]  that  balances  work  within  overlapping  pro¬ 
cessor  neighborhoods  to  achieve  a  global  load  balance.  Work  is  migrated 
from  a  processor  to  others  within  the  same  neighborhood.  We  demonstrate 
the  improved  performance  obtained  from  a  combination  of  p-adaptivity  and 
parallel  computation  on  several  examples  using  a  1024-processor  nCUBE/2 
hypercube. 

For  three-dimensional  problems  with  irregular  grids  of  tetrahedral  ele¬ 
ments  and  adaptive  h- refinement,  we  have  developed  a  tree-based  mesh  par¬ 
titioning  technique  that  exploits  the  properties  of  tree-structured  meshes. 
The  rich,  hierarchical  structure  of  these  meshes  allows  them  to  be  divided 
into  components  along  boundaries  of  the  tree  structure.  Our  partitioning 
technique  is  based  on  two  tree  traversals  that  (?)  calculate  the  processing 
costs  of  all  subtrees  of  a  node,  and  (?'?')  form  the  partitions.  Our  method 
is  inexpensive  and,  thus,  has  an  advantage  relative  to  other  global  par¬ 
titioning  techniques  [21,23,27].  We  demonstrate  the  performance  of  the 
tree-based  mesh  partitioning  technique  on  a  variety  of  three-dimensional 
meshes  and  discuss  extension  of  the  technique  for  parallel  implementation 
and  dynamic  load  balancing.  We  present  results,  using  a  Thinking  Ma¬ 
chines  CM-5  computer,  for  the  adaptive  /i-refinement  solutions  of  an  Euler 
flow  past  a  cone. 


2.  The  Discontinuous  Galerkin  Method.  Partition  the  domain 
Q  into  polygonal  elements  Qj,  j  —  1, 2, . . . ,  J,  and  construct  a  weak  form 
of  the  problem  by  multiplying  (1.1a)  by  a  test  function  v  6  L2(Q;)  and 
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integrating  the  result  on  fi;-  to  obtain 


=  0. 

Apply  the  Divergence  Theorem  to  (2.1)  to  obtain 

(2.2)  f  vTu tdT~y2  [  vlmdr  +  T  f  vTf;(u)ni  da  =  0, 
J Clj  ,*_!  Jcij  j  ^  J  dClj 


(2.1) 


/ 

JClj 


vTu(  dr  + 


fi(u))r,  dr 


-J  vl 

JCtj 


f,(u)  dr 


where  n  =  [ni,  n2, . . . ,  n,*]T  is  the  unit  outward  normal  to  dClj.  Approx¬ 
imate  u(x,<)  on  Clj  by  a  p‘A-degree  polynomial  U;(x,<)  €  S j  C  L2(Q.j), 
and  test  against  all  functions  V  £  S j.  With  initial  conditions  determined 
by  local  L 2  projection,  this  approximation  yields  the  ordinary  differential 
system 

(2.3a)  [  VT(Uj){  dr  -  W  Vj^U,)  dr 

+  V  /  VTf,(Uj)n,-  da  =  0,  t  >  0, 

,=i  Jdfij 

(2.3b)  /  VT(UJ  —  u°)  dr  =  0,  t  =  0,  VV  6  Si( 

Jtij 

which  we  solve  on  Slj,  j  =  1,2,...,/,  by  explicit  Runge-Kutta  integra¬ 
tion  of  order  p.  Integrals  are  evaluated  numerically  using  Gauss-Legendre 
quadrature.  A  basis  for  the  local  space  Sj  may  be  defined  using  prod¬ 
ucts  of  one-dimensional  Legendre  polynomials,  as  distinct  from  hierarchical 
bases  for  elliptic  and  parabolic  systems  [34]  which  use  integrals  of  Legendre 
polynomials.  Two-dimensional  bases  involving  tensor  products  of  Legen¬ 
dre  polynomials  have  been  constructed  for  quadrilateral  [11]  and  triangu¬ 
lar  [15]  elements.  A  three-dimensional  basis  for  tetrahedral  elements  could 
follow  procedures  developed  for  elliptic  systems  [34].  Results  presented  in 
Section  5  for  tetrahedral-element  meshes  involve  only  piecewise  constant 
approximations. 

The  normal  component  of  the  flux 

d 

(2.4)  fn(u)  =  ^f<(u)n< 

i=i 

remains  unspecified  on  dtlj  with  (2.3)  since  the  approximate  solution 
is  discontinuous  there.  We  specify  it  using  a  “numerical  flux”  function 
h(ut.u-)  dependent  on  solution  states  U+  and  U“  on  the  inside  and 
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outside,  respectively,  of  dQj.  Several  numerical  flux  functions  are  possi¬ 
ble  [14,31].  In  two  dimensions  [11,15],  we  have  used  the  Lax-Friedrichs 
numerical  flux, 

(2.5a)  h(U+,  UJ)  =  i[fn(Ut)  +  fn(U")  -  |a|(Ur  -  U+)], 

(2.5b)  a  =  max(|A(fnu(Uj))|),  Uf  <  Uj  <  U~ , 

where  A(fnu)  is  an  eigenvalue  of  the  Jacobian  fnu 

In  three  dimensions,  we  use  van  Leer’s  flux  vector  splitting  [25,35]  to 
construct  a  numerical  flux.  This  technique  is  not  generally  applicable  but 
does  apply  to  the  Euler  equations  of  compressible  flow  which  have  the 
solution  and  flux  vectors 

u'  =  [p,  pu' ,  pv' ,  pw' ,  ef  , 

(2  6)  fl^U^  =  [pu'’pu>2  +  P>Pu't;',/>u'u/,u/(e  +p)]T , 

f2(u')  =  [pv',pu'v',pv'2  +  p,pv'w',v'(e  +p)]T  , 

fa(u')  =  [pw1,  pu1  w' ,  pv1  w' ,  pw'2  +p,  w'(e  +p)]T  , 

where  p,  e,  and  p  are  the  density,  energy,  and  pressure;  u'  is  the  velocity 
component  in  the  direction  of  n;  and  v'  and  w'  are  velocity  components 
tangent  to  <9f .  The  numerical  flux  h  on  <9Q;  is  computed  as 

(2.7a)  h(U'“,  U'+)  =  f+(U'+)  +  ff(U'") 

where 

12  7b)  fi+(U')  =  fi(U')’  fr(u')  =  0.  if  M'  >  1; 

1  ’  fi+(U')  =  0,  ff(U')  =  fi(U'),  if  M'  <  — 1; 


1 

o  [(7  -  1)M' ±  2] /t 

ff  (IT)  =  ±p<M'fl?  v\  ,  if  |M'|  <  1. 

W 

a3[(7-l)M'±2]:‘  , 

.  2(75-l)  2 

(2.7c) 

where  M'  —  U'/a,  and  a  is  the  local  speed  of  sound.  Van  Leer’s  numerical 


flux  has  the  property  that  the  two  components  of  the  normal  flux,  fx+  and 
ff ,  depend  only  on  the  solution  on  the  same  side  of  dQj .  Therefore,  the  flux 
may  be  calculated  by  computing  each  component  separately,  exchanging 
f*  and  ff  between  elements,  and  summing.  This  splitting  approximately 
halves  the  computational  and  parallel  communications  effort  relative  to 


other  flux  evaluation  schemes. 
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In  regions  where  the  numerical  solution  is  smooth,  the  discontinuous 
Galerkin  method  produces  the  0(hp+1),  h  =  max^i^,...,^  j=i,2,...,j(A-Xi,j), 
convergence  expected  in,  e.g.,  Ll  for  a  ptA-degree  approximation  [11,14]. 
To  prevent  spurious  oscillations  that  develop  near  discontinuities  with  high- 
order  methods,  we  have  developed  a  projection  limiter  that  limits  solution 
moments  [11,14,36].  Using  a  one-dimensional  ( d  =  1)  scalar  problem  and 
the  Legendre  polynomial  basis 

p 

(2.8)  Uj(Z,t)  =  '£cjl(t)Pl(Z) 

1=0 

as  an  illustration,  the  coefficient  Cjk  is  proportional  to  the  kth  moment  Mjk 
of  Uj]  i.e., 

^2  9)  ■Mjk  ~  J  Uj{£>t)Pk(£)  d£  =  2k  +  l°^k' 

F  =  0,l,...,p- 1,  j  =  1,2, . . ., <7. 

Thus,  to  keep  Mjk  monotone,  we  must  keep  Cjk  monotone  on  neighboring 
elements,  which  we  do  by  specifying 

(2*  +  l)cj,k+i  = 


(2.10a)  minmod(( 2k  +  l)cj,fc+i,Cj+iiJt  -Cjik)Cjik  —  Cj_x,*), 

where 

minmod(a ,  6,  c)  = 

(2  10b)  f  sign(a)min(|a|,|6|,|c|),  if  sign(a)  =  sign(6)  =  sign(c) 

'  [0,  otherwise. 

The  limiter  (2.10)  is  applied  adaptively.  First,  the  highest-order  coeffi¬ 
cient  CjP  is  limited.  Then  the  limiter  is  applied  to  successively  lower-order 
coefficients  whenever  the  next  higher  coefficient  on  the  interval  has  been 
changed  by  the  limiting.  The  higher-order  coefficients  are  re-limited  using 
the  updated  low-order  coefficients  when  necessary.  In  this  way,  the  limit¬ 
ing  is  applied  only  where  it  is  needed,  and  accuracy  is  retained  in  smooth 
regions.  For  two-  and  three-dimensional  problems,  the  one-dimensional 
limiter  is  applied  in  the  direction  n  normal  to  dClj . 

For  vector  systems,  the  scalar  limiting  is  applied  to  the  characteristic 
fields  of  the  system  [13].  The  diagonalizing  matrices  T(u)  and  T-1(u) 
(consisting  of  the  right  and  left  eigenvectors  of  the  Jacobian  fnu)  are  eval¬ 
uated  using  the  average  values  of  Uj,  j  =  1,2,  on  Qj ;  the  scalar 

limiting  is  applied  to  each  field  of  the  characteristic  vector;  and  the  result 
is  transformed  back  to  physical  space  by  post-multiplication  by  T-1(U;). 
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3.  Adaptive  p- Refinement.  We  have  developed  an  adaptive  p-re¬ 
finement  version  of  the  two-dimensional  method  (2.3,  2.5,  2.10)  using  rect¬ 
angular  grids  and  a  method-of-lines  framework.  A  spatial  error  estimate  is 
used  to  control  order  variation  procedures  that  attempt  to  keep 

j 

(3.1a)  e(t)  =  '$2ei(t)  <  £, 

j= i 


where  e  is  prescribed  and 


(3.1b)  ej(t)  =  ||  /  |u(x,t)-Ui(x,t)|dr||00. 

JClj 

Control  is  done  locally  with  a  goal  of  maintaining 

(3.2)  ej(t)  <  TOL  =  j,  j  =  l,2,...,J. 

We  initialize  Uj(x,  0),  j  =  1, 2, to  the  lowest-degree  polynomial  sat¬ 
isfying  (3.2)  at  t  =  0.  For  times  t  >  0,  we  use  p-refinement  to  calculate  an 
estimate  Ej(t)  of  ej  as 

(3.3)  Ej  =  \\  f  lUf^-U^rlU  j  =  1,2, . . .  ,J, 

J  n, 

where  U?  is  the  pth-de gree  finite  element  approximation  of  u.  While  this 
estimate  is  computationally  expensive,  it  is  still  less  expensive  than  h- 
refinement  (Richardson’s  extrapolation)  techniques  and  can  be  used  to  re¬ 
duce  the  effort  involved  in  recomputing  Uj  and  its  error  estimate  when 
p-refinement  is  needed. 

A  less  expensive  error  estimation  procedure  similar  to  successful  proce¬ 
dures  for  elliptic  and  parabolic  systems  [4,6]  can  be  obtained  by  the  (p+l)’*- 
degree  polynomial  correction  to  a  ptA-degree  solution  while  making  use  of 
superconvergence  to  reduce  complexity.  We  construct  a  (p  +  l^-dggree 
correction  term  Ky(x,t)  whose  roots  are  the  superconvergence  points  of 
the  approximation,  and  then  estimate  ej  as 

(3.4a)  Ej(t)  =  ||  [  IK^x.^lcfrHoo,  j  =  1,2, . . .,  J. 

For  p  =  0,  the  superconvergence  points  remain  at  the  Legendre  roots,  but 
for  p  >  0,  the  superconvergence  points  move  toward  the  Radau  points 
[1,11],  i.e.,  the  roots  of 


Rp+i 


Pp+i(0-Pp  (0. 
Pp+i(t)  +  Pp{  0, 
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if  fnu  >  0, 
if  fnu  <  0, 


(3.4b) 


as  t  increases.  Then,  for  a  two-dimensional  approximation  using  a  basis  of 
tensor  products  of  Legendre  polynomials  on  rectangular  elements, 


ciu(t)Pii£)PM 

+cjio(t)Pl(Opo{v) 

+Cjoi(t)P0{OPi(ri),  ifp  =  0 


(3.4c)  K 


Cj,p+i,p+i  (t)Rp+i  (£)Rp+i(ri) 

p 

k  ,p+l{t)  Pk(0  Pp+l(v) 

+ej,p+iAt)Rp+i(Z)Pk{v)),  if  p  >  o. 


To  compute  Kj(x,f),  let  Uj  =  Uj  +  Kj,  j  =  1, 2, . . J,  substitute  Uj  into 
(2.3),  and  solve  for  the  coefficients  of  K;(x,t)  with  Uj  fixed. 

To  compute  Ej  using  (3.4),  we  solve  2p  +  3  additional  ordinary  dif¬ 
ferential  equations  in  two  dimensions,  compared  to  an  additional  ( p  +  2)2 
differential  equations  required  for  (3.3).  The  movement  of  the  superconver¬ 
gence  points  from  the  Legendre  points  at  t  =  0  toward  the  Radau  points 
for  t  >  0  is  gradual,  occurring  over  several  time  steps  [11].  Thus,  the 
effectiveness  of  the  estimate  improves  as  the  computation  progresses. 

After  each  time  step,  we  compute  Ej,  j  =  1, 2, . . J,  and  increase  the 
polynomial  degree  of  Uj  by  one  if  Ej  >  TOL.  The  solution  Uj  and  the 
error  estimate  are  recomputed  on  enriched  elements,  and  further  increases 
of  degree  occur  until  Ej  <  TOL  on  ail  elements. 

We  reduce  the  need  for  back-tracking  by  predicting  the  degree  of  the 
approximation  needed  to  satisfy  the  accuracy  requirements  during  the  next 
time  step.  After  a  time  step  is  accepted,  if  Ej  >  Hmax TOL,  Hmax  £  (0, 1], 
we  increase  the  degree  of  Uj(x,  t  +  At)  for  the  next  time  step.  If  Ej  < 
HminTOL,  Hmin  €  [0, 1),  we  decrease  the  degree  of  Uj(x,f  -f  At)  for  the 
next  time  step. 

Example  1.  We  demonstrate  the  accuracy  of  the  error  estimate  (3.4) 
in  terms  of  its  effectivity  index 


Estimated  Error 

^  "  Actual  Error 

for  the  two-dimensional  problem 

(3.6a)  ut  +  ux  +  uy  =  0,  -l<x,y<l,  t  >  0, 

with 

(3.6b)  u°(x,  y)  =  sin(xar)  sin(5rj/),  — l<i,y<l, 

and  periodic  boundary  conditions.  In  Table  3.1,  we  show  the  actual  errors 
and  effectivity  indices  with  p  =  0, 1,  and  2.  Each  time  the  mesh  is  refined, 
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the  time  step  is  halved,  and  the  number  of  time  steps  is  doubled.  Effectivity 
indices  are  near  unity  for  the  entire  range  of  computation  when  p  =  0.  For 
p  —  1  and  2,  the  error  estimate  improves  as  the  mesh  is  refined  since  the 
superconvergence  points  move  closer  to  the  Radau  points  after  each  time 
step. 


Number  of 
Elements 

Actual 

Error 

0 

p  =  0 

16  x  16 

2.66838e  -  1 

32  x  32 

64  x  64 

128  x  128 

256  x  256 

p  =  1 

16  x  16 

1.45948e-  2 

HS1 

32  x  32 

4.21090e  —  3 

EH 

64  x  64 

1.11300e  —  3 

0.975 

128  x  128 

2.79793e  —  4 

1.000 

256  x  256 

6.99557e  -  5 

1.000 

P  =  2 

16  x  16 

6.41413e  —  4 

0.557 

32  x  32 

9.68358e  —  5 

64  x  64 

9.68224e  -  6 

1.128 

128  x  128 

1.26721c -6 

1.009 

256  x  256 

1.58712e  —  7 

1.000 

512  x  512 

1.98384e  —  8 

1.000 

Table  3.1 

Errors  ani  effectivity  indices  9  at  t  =  0.025  using  (3.4)  for  Example  1. 


Example  2.  Consider 


(3.7a)  ut  +  2ux  +  2uy  =0,  0<x,y<l,  t>  0, 

with  initial  and  Dirichlet  boundary  conditions  specified  so  that  the  exact 
solution  is 

(3.7b)  u(x,y,t )  =  i(l  -  tanh(20a:  —  10p-  20 1  +  5)),  0  <  x,y  <  1. 

In  Figure  3.1,  we  show  the  exact  solution  of  (3.7)  at  time  t  =  0  and  the 
degrees  generated  on  a  adaptive  16  x  16-element  mesh  to  satisfy  the  initial 
data  when  TOL  =  10-5. 

We  solve  (3.7)  by  both  fixed-order  and  adaptive  p-refinement  methods 
on  0  <  t  <  0.1.  In  Figure  3.2,  we  show  the  global  Z^-error  versus  the 
CPU  time  for  fixed-order  methods  with  p  —  0,  1,  and  2  on  8  x  8,  16  x  16, 
32  x  32,  and  64  x  64-element  meshes,  and  the  p- adaptive  method  with 
Hmax  =  0.9,  Hmin  =  0.1,  and  TOL  ranging  from  5  x  10~9  to  5  x  10-4  on 
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Fig.  3.1.  Exact  solution  o]  (3.7)  at  t  =  0  and  degrees  generated  on  a  16  x  16-element 
mesh  with  TOL  =  10 ~s forExample  2. 

a  16  x  16-element  mesh.  The  adaptive  p-refinement  method  requires  more 
computation  than  the  fixed-order  methods  for  large  error  tolerances,  but 
because  of  its  increasing  convergence  rate,  it  requires  less  work  than  the 
fixed-order  methods  to  obtain  small  errors. 

4.  Dynamic  Load  Balancing  via  Tiling.  Tiling  [37,38]  is  a  modi¬ 
fication  of  the  global  load  balancing  technique  of  Leiss  and  Reddy  [24,29] 
who  used  local  balancing  within  overlapping  processor  neighborhoods  to 
achieve  a  global  load  balance.  A  neighborhood  is  defined  as  a  processor 
at  the  center  of  a  circle  of  some  predefined  radius  and  all  other  processors 
within  the  circle.  Processors  within  a  given  neighborhood  are  balanced 
with  respect  to  each  other  using  local  performance  measurements.  Individ¬ 
ual  processors  may  belong  to  several  neighborhoods.  Work  can  be  migrated 
from  a  processor  to  any  other  processor  within  the  same  neighborhood.  In 
tiling,  we  extend  the  definition  of  a  neighborhood  to  include  all  processors 
having  finite  elements  that  are  neighbors  of  elements  in  the  center  proces¬ 
sor  (see  Figure  4.1).  Tiling  neighborhoods  are  not  related  to  the  hardware 
interconnection  of  the  processors  as  were  the  neighborhoods  of  Leiss  and 
Reddy  [24].  Every  processor  is  the  center  of  one  neighborhood,  and  may 
belong  to  many  neighborhoods.  Elements  are  migrated  only  to  processors 
having  neighbors  of  the  migrating  elements. 

The  tiling  algorithm  consists  of  (i)  a  computation  phase  and  (ii)  a  bal¬ 
ancing  phase,  and  is  designed  to  be  independent  of  the  application.  The 
computation  phase  corresponds  to  the  application’s  implementation  with¬ 
out  load  balancing.  Each  processor  operates  on  its  local  data,  exchanges 


FlG.  3.2.  Convergence  of  the  adaptive  p-rep.nem.ent  method  and  pxed-order  methods 
with  p  =  0,1,  and  2  for  Example  2. 


inter-processor  boundary  data,  and  processes  the  boundary  data.  A  bal¬ 
ancing  phase  restores  load  balance  following  a  given  number  of  computation 
phases.  Each  balancing  phase  consists  of  the  following  operations: 

1.  Determine  work  loads.  Each  processor  determines  its  work  load 
as  the  time  to  process  its  local  data  since  the  previous  balancing 
phase  less  the  time  to  exchange  inter-processor  boundary  data  dur¬ 
ing  the  computation  phase.  Neighborhood  average  work  loads  are 
also  calculated. 

2.  Determine  processor  work  requests.  Each  processor  com¬ 
pares  its  work  load  to  the  work  load  of  the  other  processors  in  its 
neighborhood  and  determines  those  processors  having  loads  greater 
than  its  own.  If  any  are  found,  it  selects  the  one  with  the  greatest 
work  load  (ties  are  broken  arbitrarily)  and  sends  a  request  for  work 
to  that  processor.  Each  processor  may  send  only  one  work  request, 
but  a  single  processor  may  receive  several  work  requests. 

3.  Select  elements  to  satisfy  work  requests.  Each  processor  pri¬ 
oritizes  the  work  requests  it  receives  based  on  the  request  size,  and 
determines  which  elements  to  export  to  the  requesting  processor. 
Elemental  processing  costs  are  used  so  that  the  minimum  num¬ 
ber  of  elements  satisfying  the  work  request  are  exported.  (This 
approach  differs  from  Wheat  [37],  where  the  average  cost  per  ele¬ 
ment  is  used  to  determine  the  number  of  export  elements).  Details 
of  the  selection  algorithm  follow. 

4.  Notify  and  transfer  elements.  Once  elements  to  be  exported 
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FlG.  4.1.  Examples  of  12  processors  in  12  neighborhoods  using  the  Leiss/Reddy[24,29j 
(left)  and  the  tiling  (right)  definitions. 


_  have  been  selected,  the  importing  processors  and  processors  con¬ 
taining  neighbors  of  the  exported  elements  are  notified.  Importing 
processors  allocate  space  for  the  incoming  elements,  and  the  ele¬ 
ments  are  transferred. 

Each  processor  knows  the  number  of  computation  phases  to  perform 
before  entering  the  balancing  phase.  Synchronization  guarantees  that  all 
processors  enter  the  balancing  phase  at  the  same  time. 

The  technique  for  selecting  elements  gives  priority  to  elements  with 
neighbors  in  the  importing  processor  to  prevent  the  creation  of  “narrow, 
deep  holes”  in  the  element  structures.  Elements  are  assigned  priorities 
(initially  zero)  based  upon  the  locality  of  their  element  neighbors.  An 
element’s  priority  is  decreased  by  one  for  each  element  neighbor  in  its  own 
processor,  increased  by  two  for  each  neighbor  in  the  importing  processor, 
and  decreased  by  two  for  each  neighbor  in  some  other  processor.  Thus, 
elements  whose  neighbors  are  already  in  the  importing  processor  are  more 
likely  to  be  exported  to  that  processor  than  elements  whose  neighbors  are 
in  the  exporting  processor  or  some  other  processor.  When  an  element  hats 
no  neighboring  elements  in  its  local  processor,  it  is  advantageous  to  export 
it  to  any  processor  having  its  neighbors.  Thus,  “orphaned”  elements  are 
given  the  highest  export  priority.  When  two  or  more  elements  have  the 
same  priority,  the  processor  selects  the  element  with  the  largest  work  load 
that  does  not  cause  the  exported  work  to  exceed  the  work  request  or  the 
work  available  for  export. 

In  Figure  4.2,  we  illustrate  an  example  of  element  priorities  and  se¬ 
lection  for  satisfying  a  work  request  of  55  units  from  the  east  neighboring 
processor.  Initially,  elements  3,  6,  9,  and  12  are  eligible  for  export.  Their 
priorities  are  computed;  element  3,  for  example,  has  priority  -2,  since  it 
has  two  local  neighbors  (-2),  one  neighbor  in  the  importing  processor  (+2), 
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and  one  neighbor  in  some  other  processor  (-2).  Elements  6  and  9  share 
the  highest  priority,  but  element  6  is  selected  because  it  has  a  greater  work 
load.  Element  5  becomes  eligible  for  export,  but  its  priority  is  low  since 
it  has  three  local  neighbors.  The  priorities  are  adjusted,  and  element  9  is 
selected,  making  element  8  a  candidate.  The  priorities  are  again  updated, 
and  the  selection  process  continues  with  elements  3  and  12  being  selected. 
Although  the  work  request  is  not  completely  satisfied,  no  other  elements 
are  exported,  since  the  work  loads  of  the  elements  with  the  highest  priority, 
5  and  8,  are  greater  than  the  remaining  work  request. 

Example  3.  We  solve  (3.7)  with  a  fixed-order  method  ( p  =  3)  on  a 
32  x  32-element  mesh  and  tiling  on  16  processors  of  the  nCUBE/2  hyper¬ 
cube.  In  Figure  4.3  (left),  we  show  the  processor  domain  decomposition 
after  20  time  steps.  The  tiling  algorithm  redistributes  the  work  so  that  pro¬ 
cessors  containing  elements  on  the  domain  boundary  have  fewer  elements 
than  those  in  the  interior  of  the  domain.  The  global  error  of  the  numerical 
solution  is  4.76766  x  10~3.  The  total  processing  time  was  reduced  by  5.18% 
from  128.86  seconds  to  122.18  seconds  by  balancing  once  each  time  step. 
The  average/maximum  processor  work  ratio  without  balancing  is  0.858; 
with  balancing,  it  is  0.942.  Parallel  efficiency  is  increased  from  90.80% 
without  balancing  to  95.58%  with  tiling. 

We  then  solve  (3.7)  using  the  adaptive  p-refinement  method  on  a 
32  x  32-element  mesh  with  TOL  =  3.5  x  10~6  and  tiling  on  16  proces¬ 
sors.  In  Figure  4.3  (right),  we  show  the  processor  domain  decomposition 
after  20  time  steps.  The  shaded  elements  have  higher-degree  approxima¬ 
tions  and,  thus,  higher  work  loads.  The  tiling  algorithm  redistributes  the 
work  so  that  processors  with  high-order  elements  have  fewer  elements  than 
those  processors  with  low-order  elements.  The  global  error  of  the  adaptive 
solution  is  4.44426  x  10-3,  less  than  the  fixed-order  method  above.  The 
total  processing  time  for  the  adaptive  method  was  reduced  41.98%  from 
63.94  seconds  to  37.10  seconds  by  balancing  once  each  time  step.  The  av¬ 
erage/maximum  processor  work  ratio  without  balancing  is  0.362,  and  with 
balancing,  it  is  0.695.  Parallel  efficiency  is  increased  from  35.10%  without 
balancing  to  60.51%  with  tiling. 

Example  4.  We  solve  (3.7)  for  225  time  steps  on  all  1024  processors  of 
the  nCUBE/2  without  balancing  and  with  balancing  once  each  time  step. 
A  fixed-order  method  with  p  —  2  produces  a  solution  with  global  error 
6.40644  x  10-2.  Using  the  tiling  algorithm  reduced  the  total  execution 
time  6.25%  from  1601.96  seconds  without  balancing  to  1501.90  seconds 
with  balancing  (see  Table  4.1).  Parallel  efficiency  without  balancing  was 
82.7%;  with  balancing,  it  was  88.2%. 

The  adaptive  p-refinement  method  produced  a  solution  with  global 
error  6.32205  x  10-2,  comparable  to  the  fixed-order  solution.  With  bal¬ 
ancing,  the  maximum  computation  time  (not  including  communication  or 
balancing  time)  was  reduced  by  49.8%  (see  Table  4.1).  The  irregular  sub- 
domain  boundaries  created  by  the  tiling  algorithm  increased  the  average 
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Elements  3,  6,  9,  and  1 2  are  eligible  for  export.  Initial  work  request  =  55. 
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Element  6  is  selected  for  export  and  5  becomes  an  export  candidate. 

Work  request  =  42. 


work:  41 


work:  41 


work:  41 


work:  13  It  |  work:  5 


After  second  selection,  work  request  =  37. 


work:  25  2  work:  41  I  3  work:  25 

priority:  >2  I 


work:  41  5  I  work:  41- 1  6  1  work:  13 

mmriarmi  -l  I  - 1 


work:  41  I  8  work:  13  I  9  work:  5 

— priority:  -1  |  — 


work:  13  11  |  work:  5 


After  third  selection,  work  request  =  12. 


work:  25 


IB: 


work:  41 


work:  41 


work:  13 


After  fourth  selection,  work  request  =  7;  no  other  elements  are  exported. 
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s 

work:  25 

_ 
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_®j 

work:  5 

work:  5 

Fig.  4.2.  Example  of  element  priorities  and  the  export  element  selection  algorithm. 
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Fig.  4.3.  Processor  domain  decompositions  after  20  time  steps  for  Example  3  using 
fixed-order  (left)  and  adaptive  order  (right)  methods.  Dark  lines  represent  processor 
subdomain  boundaries. 


communication  time  by  2.5%.  Despite  the  extra  communication  time  and 
the  load  balancing  time,  however,  we  see  a  36.3%  improvement  in  the  total 
execution  time. 

In  Figure  4.4,  we  show  the  maximum  processing  costs  per  time  step,  in¬ 
cluding  the  computation  and  balancing  times,  for  the  adaptive  p-refinement 
method.  The  dashed  and  solid  lines  represent  the  maximum  cost  without 
and  with  balancing,  respectively.  The  balanced  computation’s  maximum 
cost  per  time  step  is  significantly  lower  than  without  balancing.  The  spikes 
in  both  curves  occur  when  the  error  tolerance  was  not  satisfied  on  some 
elements  and  the  adaptive  p-refinement  method  back-tracked  to  compute 
a  more  accurate  solution.  In  Figure  4.5,  we  show  the  cumulative  maxi¬ 
mum  processing  times  with  and  without  balancing.  The  immediate  and 
sustained  improvement  of  the  application’s  performance  is  shown. 
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Fixed-Order  (p=2)  Adaptive  p-refinement 
Global  Error:  0.06406  Global  Error:  0.06322 


Without 

Tiling 

With 

Tiling 

Without 

Tiling 

With 

Tiling 

Total  Execution 
Time  (seconds) 

1601.96 

1501.90 

858.50 

546.75 

Max.  Computation 
Time  (seconds) 

1549.77 

1429.24 

782.93 

393.32 

Average  /  M  aximum 
Work  Ratio 

0.855 

0.927 

0.427 

0.851 

Avg.  Communication 
Time  (seconds) 

59.09 

59.09 

70.85 

72.65 

Max.  Balancing 
Time  (seconds) 

0.0 

20.88 

0.0 

23.46 

Parallel 

Efficiency 

82.7% 

88.2% 

38.98% 

61.21% 

Table  4.1 


Performance  comparison  for  Example  4  using  fixed-order  and  adaptive  methods  without 
and  with  balancing  at  each  time  step. 


5.  Three-Dimensional  Mesh  Partitioning.  We  describe  a  tree- 
based  partitioning  technique  which  utilizes  the  hierarchical  structure  of 
octree-derived  unstructured  meshes  to  distribute  elemental  data  across  pro¬ 
cessors’  memories  while  reducing  the  amount  of  data  that  must  be  ex¬ 
changed  between  processors.  An  octree-based  mesh  generator  [30]  recur¬ 
sively  subdivides  an  embedding  of  the  problem  domain  in  a  cubic  universe 
into  eight  octants  wherever  more  resolution  is  required.  Octant  bisection 
is  initially  based  on  geometric  features  of  the  domain  but  solution-based 
criteria  are  introduced  during  an  adaptive  /i-refinement  process.  Finite 
element  meshes  of  tetrahedral  elements  are  generated  from  the  octree  by 
subdividing  terminal  octants. 

In  Figure  5.1,  we  illustrate  the  tree  and  mesh  for  a  two-dimensional 
flow  domain  containing  a  small  object.  The  root  of  the  tree  represents 
the  entire  domain  (Figure  5.1c).  The  domain  is  recursively  quartered  until 
an  adequate  resolution  of  the  object  is  obtained  (Figure  5.1a).  A  smooth 
gradation  is  maintained  by  enforcing  a  one-level  maximum  difference  be¬ 
tween  adjacent  quadrants.  After  appropriate  resolution  is  obtained,  leaf 
quadrants  are  subdivided  into  triangular  elements  that  are  pointed  to  by 
leaf  nodes  of  the  tree  (Figures  5.1b,c).  The  leaf  quadrant  containing  the 
object  must  be  decomposed  into  triangles  based  on  the  geometry  of  the 
object  boundary.  Smoothing,  which  normally  follows  element  creation,  is 
not  shown. 

Our  tree-based  based  partitioning  algorithm  creates  a  one-dimensional 
ordering  of  the  octree  and  divides  it  into  nearly  equal-sized  segments  based 
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FIG.  5.1.  A  quadtree  representation  of  the  flow  field  surrounding  an  object  (a),  division 
of  terminal  quadrants  into  triangular  elements  (b),  and  quadtree  structure  (c). 


on  tree  topology.  The  first  step  of  the  algorithm  is  the  determination  of 
cost  metrics  for  all  subtrees.  Cost  is  currently  defined  as  the  number  of 
elements  within  a  subtree.  For  a  leaf  octant,  this  would  simply  be  the 
number  of  tetrahedra  associated  with  it.  .P- refinement  would  necessitate 
the  inclusion  of  an  element’s  order  into  the  cost  function.  If  the  solution 
algorithm  employs  spatially-dependent  time  steps  then,  typically,  a  greater 
number  of  smaller  time  steps  must  be  taken  on  smaller  elements  and  this 
must  also  be  reflected  in  the  subtree  cost.  In  any  event,  appropriate  costs 
may  be  determined  by  a  postorder  traversal  of  the  octree. 

The  second  phase  of  the  partitioning  algorithm  uses  the  cost  informa¬ 
tion  to  construct  the  actual  partitions.  Since  the  number  of  partitions  is 
prescribed  and  the  total  cost  is  known  from  the  first  phase,  we  also  know 
the  optimal  size  of  each  partition.  Partitions  consist  of  a  set  of  octants 
that  are  each  the  root  of  a  subtree  and  are  determined  by  a  truncated 
depth-first  search.  Thus,  octree  nodes  are  visited  in  depth-first  order,  and 
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subtrees  are  accumulated  into  successive  partitions.  The  subtree  rooted 
at  the  visited  node  is  added  to  the  current  partition  if  it  fits.  If  it  would 
exceed  the  optimal  size  of  the  current  partition,  a  decision  must  be  made 
as  to  whether  it  should  be  added,  or  whether  the  traversal  should  examine 
it  further.  In  the  latter  case,  the  traversal  continues  with  the  offspring  of 
the  node  and  the  subtree  may  be  divided  among  two  or  more  partitions. 
The  decision  on  whether  to  add  the  subtree  or  examine  it  further  is  based 
on  the  amount  by  which  the  optimal  partition  size  is  exceeded.  A  small 
excess  may  not  justify  an  extensive  search  and  may  be  used  to  balance 
some  other  partition  which  is  slightly  undersized.  When  the  excess  at  a 
node  is  too  large  to  justify  inclusion  in  the  current  partition,  and  the  node 
is  either  terminal  or  sufficiently  deep  in  the  tree,  the  partition  is  closed  and 
subsequent  nodes  are  added  to  the  next  partition. 

This  partitioning  method  requires  storage  for  nonterminal  nodes  of  the 
tree  which  would  normally  not  be  necessary  since  they  contain  no  solution 
data.  However,  only  minimal  storage  costs  are  incurred  since  information 
is  only  required  for  tree  connectivity  and  the  cost  metric.  For  this  modest 
investment,  we  have  a  partitioning  algorithm  that  only  requires  0{J)  serial 
steps. 

Partitions  formed  by  this  procedure  do  not  necessarily  form  a  single 
connected  component;  however,  the  octree  decomposition  and  the  orderly 
tree  traversal  tend  to  group  neighboring  subtrees  together.  Furthermore,  a 
single  connected  component  is  added  to  the  partition  whenever  a  subtree 
fits  within  the  partition. 

A  tree-partitioning  example  is  illustrated  in  Figure  5.2.  All  subtree 
costs  are  determined  by  a  post  order  traversal  of  the  tree.  The  partition 
creation  traversal  starts  at  the  root,  Node  0  (Figure  5.2a).  The  node  cur¬ 
rently  under  investigation  is  identified  by  a  double  circle.  The  cost  of  the 
root  exceeds  the  optimal  partition  cost,  so  the  traversal  descends  to  Node 
1  (Figure  5.2b).  As  shown,  the  cost  of  the  subtree  rooted  at  node  1  is 
smaller  than  the  optimal  partition  size  and,  hence,  this  subtree  is  added 
to  the  current  partition,  pO,  and  the  traversal  continues  at  Node  2  (Fig¬ 
ure  5.2c).  The  cost  of  the  subtree  rooted  at  Node  2  is  too  large  to  add 
to  pO,  so  the  algorithm  descends  to  an  offspring  of  Node  2  (Figure  5. 2d). 
Assuming  Node  4  fits  in  pO,  the  traversal  continues  with  the  next  offspring 
of  Node  2  (Figure  5.2e).  Node  5  is  a  terminal  node  whose  cost  is  larger 
than  the  available  space  in  pO,  so  the  decision  is  made  to  close  pO  and  begin 
a  new  partition,  pi.  As  shown  (Figure  5.2f),  Node  5  is  very  expensive,  and 
when  the  traversal  is  continued  at  Node  3,  pi  must  be  closed  and  work 
continues  with  partition  p2. 

The  tree-traversal  partitioning  algorithm  may  easily  be  extended  for 
use  with  a  parallel  adaptive  environment.  An  initial  partitioning  is  made 
using  the  serial  algorithm  described  above.  As  the  numerical  solution  ad¬ 
vances  in  time,  h-  and/or  p-refinement  introduces  a  load  imbalance.  To 
obtain  a  new  partitioning,  let  each  processor  compute  its  subtree  costs  us- 


19 


FlG.  5.2.  A  tree  partitioning  example.  The  partition-creation  traversal  starts  at  the  root 
(a).  Nodes  are  visited  and  aided  to  the  current  partition  if  their  subtree  fits  (b).  When 
a  subtree  is  too  large  to  fit  (c),  the  traversal  descends  into  the  subtree  (d).  Alternatively, 
the  partition  is  closed  and  work  begins  on  a  new  partition  (e).  The  process  continues 
until  the  traversal  is  complete  (}). 


ing  the  serial  traversal  algorithm  within  its  domain.  This  step  requires 
no  interprocessor  communication.  An  inexpensive  parallel  prefix  operation 
may  be  performed  on  the  processor-subtree  totals  to  obtain  a  global  cost 
structure.  This  information  enables  a  processor  to  determine  where  its 
local  tree  traversal  is  located  in  the  global  traversal. 

Now,  following  the  serial  procedure,  each  processor  may  traverse  its 
subtrees  to  create  partitions.  A  processor  determines  the  partition  number 
to  start  working  on  based  on  the  total  cost  of  processors  preceeding  it. 
Each  processor  starts  counting  with  this  prefix  cost  and  traverses  its  sub¬ 
trees  adding  the  cost  of  each  visited  node  to  this  value.  Partitions  end  near 
cost  mutiples  of  N/P,  where  N  is  the  total  cost  and  P  is  the  number  of 
processors.  Exceeding  a  multiple  of  N/P  during  the  traversal  is  analagous 
to  exceeding  the  optimal  partition  size  in  the  serial  case  and  the  same  crite¬ 
ria  may  be  used  to  determine  where  to  end  partitions.  When  all  processors 
finish  their  traversals,  each  subtree  (and  its  associated  data)  is  assigned  to 
a  new  partition  and  may  be  migrated  to  its  new  location.  Migration  may  be 
done  using  global  communication;  however,  on  some  architectures,  it  may 
be  more  efficient  to  move  data  via  simultaneous  processor  shift  operations. 
This  linear  communication  pattern  is  made  possible  by  the  one-dimensional 
nature  of  the  partition  traversal. 

While  the  cost  of  computing  the  new  partition  is  small,  the  cost  of 
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FlG.  5.3.  Iterative  rebalancing  of  tree-based  partitions.  The  subtree  rooted  at  Node  4 
(a)  has  been  shifted  from  pO  to  pi  (b)  to  relieve  a  load  imbalance.  The  new  root  of  pi 
is  Node  2,  the  common  parent  of  Nodes  4  &nd  5. 


data  movement  is  likely  to  be  high  and  it  would  be  desirable  to  amor¬ 
tize  this  by  tolerating  small  imbalances.  A  strategy  to  delay  the  need  for 
complete  repartitioning  would  simply  shift  partition  boundaries,  thus,  mi¬ 
grating  subtrees  from  a  processor  P„  to  its  neighbors  Pn_i  and  Pn+i-  If, 
for  example,  processor  P„  seeks  to  transfer  cost  m  to  Pn-i,  it  simply  tra¬ 
verses  its  subtrees  accumulating  their  costs  until  it  reaches  m.  The  nodes 
visited  comprise  a  subtree  which  may  be  transferred  to  Pn_i  and  which  is 
contiguous  with  the  subtrees  in  Pn-i  ■  Likewise,  if  Pn  desires  to  transfer 
work  to  Pn+i,  the  reverse  traversal  could  remove  a  subtree  from  the  trail¬ 
ing  part  of  P„.  Consider,  as  an  example,  the  subtree  rooted  at  Node  4  of 
Figure  5.3a  and  suppose  that  its  cost  has  increased  through  refinement.  In 
Figure  5.3b,  we  show  how  the  partition  boundary  may  be  shifted  to  move 
the  subtree  rooted  at  Node  4  to  partition  pi.  The  amount  of  data  to  be 
moved  from  processor  to  processor  may  utilize  a  relaxation  algorithm  or 
the  tiling  procedure  discussed  in  Section  4. 

Example  5.  Performance  results  obtained  by  applying  the  tree-based 
mesh  partitioning  algorithm  to  various  three-dimensional  irregular  meshes 
are  presented  in  Figure  5.4.  The  meshes  were  generated  by  the  Finite  Oc¬ 
tree  mesh  generator  [30].  “Airplane”  is  a  182K-element  mesh  of  the  volume 
surrounding  a  simple  airplane  [17].  “Copter”  is  a  242K-element  mesh  of 
the  body  of  a  helicopter  [17].  “Onera,”  “Onera2,”  and  “Onera3”  are  16K-, 
70K-,  and  293K-element  meshes,  respectively,  of  the  space  surrounding  a 
swept,  untwisted  Onera-M6  wing  which  has  been  refined  to  resolve  a  bow 
shock  [18].  “Cone”  is  a  139K-element  mesh  of  the  space  around  a  cone  hav¬ 
ing  a  10°  half-angle  and  which  also  has  been  refined  to  resolve  a  shock. 

The  quality  of  a  partition  has  been  measured  as  the  percent  of  element 
faces  lying  on  inter-partition  boundaries  relative  to  the  total  number  of 
faces  of  the  mesh.  Graphs  in  Figure  5.4  display  these  percentages  as  a 
function  of  the  optimal  partition  size.  In  all  cases  the  cost  variance  between 
the  partitions  is  very  small  (about  as  small  as  the  maximum  cost  of  a  leaf 
octant).  The  proportion  in  Figure  5.4  is,  in  a  sense,  the  total  surface  area 


21 


c 

8 

5 

CL 


0 

0  10000  20000  30000  40000  50000  60000 


Optimal  Partition  Size 


FlG.  5.4.  Performance  of  the  tree  partitioning  algorithm  on  five  meshes:  large-scale 
(top)  and  small-scale  (bottom). 
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that  partitions  hold  in  common.  Smaller  ratios  require  less  communication 
relative  to  the  amount  of  local  data  access.  This  measure  is  closely  related 
to  the  number  of  “cuts”  that  the  partition  creates  [23,20,32];  however,  we 
have  chosen  to  normalize  by  the  total  number  of  faces  in  order  to  compare 
partition  quality  over  a  wide  range  of  mesh  sizes  and  number  of  partitions. 

In  large  scale  (top)  the  data  of  Figure  5.4  show  the  expected  behaviour 
that  the  interface  proportion  approaches  zero  as  the  partition  size  increases 
(due  to  the  number  of  partitions  approaching  unity).  Conversely,  as  the 
optimal  partition  size  approaches  unity  (due  to  number  of  partitions  ap¬ 
proaching  the  number  of  elements),  the  interface  proportion  goes  to  unity. 
Examination  of  the  small  scale  (bottom)  results  reveals  that  the  interface 
proportion  is  less  than  12%  when  the  partition  size  exceeds  1000  for  these 
meshes.  Interfaces  drop  to  below  9%  and  8%,  respectively,  for  partition 
sizes  of  2000  and  3000.  This  performance  is  comparable  to  recursive  spec¬ 
tral  bisection  [22]  but  requires  much  less  computation  (O(J)  as  opposed  to 

0(J2)[  27]). 

The  best  performance  occurred  with  the  helicopter  mesh,  which  was 
the  only  mesh  of  a  solid  object  (as  opposed  to  a  flow  field  surrounding 
an  object).  The  solid  can  easily  be  cut  along  its  major  axis  to  produce 
partitions  with  small  inter-partition  boundaries,  and  was  included  for  gen¬ 
erality.  The  lowest  performance  occurred  with  the  cone  mesh.  This  is  most 
likely  due  to  the  model  and  shock  region  being  conically  shaped,  which 
is  somewhat  at  odds  with  the  rectangular  decomposition  imposed  by  the 
octree. 

In  general,  inter-partition  boundaries  should  be  less  than  10%,  indi¬ 
cating  partition  sizes  of  2000  or  more.  This  minimum  partition  size  is  not 
an  excessive  constraint,  since  a  typical  three-dimensional  problem  employ¬ 
ing  a  two  million-element  mesh  being  solved  on  a  1024-processor  computer 
would  have  about  2000  elements  per  processing  element. 

Another  measure  of  partition  quality  is  the  percent  of  a  partition’s  ele¬ 
ment  faces  lying  on  inter-partition  boundaries  relative  to  the  total  number 
of  faces  in  that  partition.  This  number  is,  in  a  sense,  the  ratio  of  surface 
area  to  volume  of  a  partition.  For  our  example  meshes,  this  measure  was 
below  22%  and  18%,  respectively,  for  partition  sizes  of  1000  and  1500. 

Example  6.  In  Figure  5.5  we  show  partitions  of  several  meshes  from 
Example  5.  The  partitions  exhibit  a  blocked  structure;  however,  several 
partitions  of  the  airplane  mesh  appear  to  be  made  up  of  disconnected  com¬ 
ponents.  While  this  is  possible,  although  unlikely,  in  this  case  the  partitions 
appear  to  be  disconnected  because  the  display  is  a  two-dimensional  slice 
through  the  three-dimensional  domain. 

Example  7.  In  Figure  5.6  we  show  the  pressure  contours  of  a  Mach  2 
Euler  flow  (1.1, 2.6)  past  the  “Cone”  mesh  of  Example  5.  The  solution  em¬ 
ploys  van  Leer’s  flux  vector  splitting  (2.7)  and  was  computed  on  a  Thinking 
Machines  CM-5  with  128  processors.  Several  iterations  of  /i-refinement  were 
required  to  yield  this  mesh.  At  each  iteration,  elements  were  marked  with 
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Fig.  5.5.  The  airplane  mesk,  and  three  refinements  of  the  Onera  M6  wing  mesh,  all 
divided  into  32  partitions.  Each  color  represents  a  different  partition. 
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FIG.  5.6.  Shock  surface  and  pressure  contours  found  when  computing  the  Mach  2  flow 
past  a  cone  having  a  half-angle  of  10°  (top).  Partitions  of  the  mesh  into  16  (left)  and 
32  (right)  pieces  (bottom).  Each  color  represents  a  different  partition. 
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the  desired  tree  level  (either  larger  for  refinement,  or  smaller  for  coarsen¬ 
ing),  and  a  new  global  mesh  created  to  satisfy  these  constraints.  The  shock 
surface  and  pressure  contours  are  shown  above;  below  are  examples  of  how 
the  mesh  may  be  partitioned  for  16  and  32  processor  machines.  Each  color 
represents  membership  in  a  different  partition  (and,  hence,  residence  on  a 
different  processor). 

6.  Conclusion.  We  have  demonstrated  the  effectiveness  of  adaptive 
methods  for  solving  systems  of  hyperbolic  conservation  laws  on  massively 
parallel  computers.  Using  a  discontinuous  Galerkin  finite  element  method 
with  projection  limiting  of  moments  of  the  solution  within  an  element,  we 
can  model  problems  with  discontinuities  sharply  without  spurious  oscil¬ 
lations.  The  discontinuous  Galerkin  method  has  a  small  computational 
stencil,  enabling  its  efficient  implementation  on  massively  parallel  comput¬ 
ers.  Adaptive  p-  and  /i-refinement  methods  provide  faster  convergence  than 
traditional  methods,  but  their  nonuniform  work  loads  create  load  imbal¬ 
ance  on  parallel  computers,  reducing  the  parallel  efficiency  of  the  methods. 
We  correct  the  load  imbalance  by  using  a  dynamic  load  balancing  technique 
called  tiling  that  produces  a  global  balance  by  performing  local  balancing 
within  overlapping  neighborhoods  of  processors.  Using  tiling  and  adaptive 
p-refinement,  computation  of  a  two-dimensional  example  required  approxi¬ 
mately  one-third  as  much  time  as  a  fixed-order  computation  with  the  same 
global  accuracy.  In  three  dimensions,  we  have  demonstrated  the  effec¬ 
tiveness  of  a  tree-based  mesh  partitioning  algorithm  for  reducing  parallel 
communication  costs.  This  algorithm  performs  almost  as  well  as  recursive 
spectral  bisection,  but  requires  much  less  work  to  compute  a  partitioning. 

In  future  work,  we  will  combine  the  adaptive  h-  and  p-refinement  tech¬ 
niques  to  obtain  an  adaptive  hp- refinement  method  that  can  optimize  com¬ 
putational  effort  in  both  smooth  and  discontinuous  solution  regions.  We 
will  extend  the  tiling  algorithm  to  incorporate  the  changing  data  structures 
required  for  h- refinement,  and  experiment  with  load  balancing  strategies  for 
adaptive  /ip-refinement  meshes.  The  tree-based  partitioning  algorithm  will 
be  extended  to  operate  in  parallel,  and  we  will  experiment  with  dynamic 
rebalancing  strategies. 
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